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Abstract 

We describe the first DNA-based storage architecture that enables random access to data blocks and rewriting of 
information stored at arbitrary locations within the blocks. The newly developed architecture overcomes drawbacks 
of existing read-only methods that require decoding the whole Hie in order to read one data fragment. Our system 
is based on new constrained coding techniques and accompanying DNA editing methods that ensure data reliability, 
specificity and sensitivity of access, and at the same time provide exceptionally high data storage capacity. As a proof 
of concept, we encoded parts of the Wikipedia pages of six universities in the USA, and selected and edited parts of 
the text written in DNA corresponding to three of these schools. The results suggest that DNA is a versatile media 
suitable for both ultrahigh density archival and rewritable storage applications. 

Addressing the emerging demands for massive data repositories, and building upon the rapid development of tech¬ 
nologies for DNA synthesis and sequencing, a number of laboratories have recently outlined architectures for archival 
DNA-based storage PM EH mg. The architecture in [3] achieved a storage density of 700 TB/gram, while the system 
described in [4j raised the density to 2.2 PB/gram. The success of the latter method may be largely attributed to three 
classical coding schemes: Huffman coding, differential coding, and single parity-check coding |4j. Huffman coding was 
used for data compression, while differential coding was used for eliminating homopolymers (i.e., repeated consecutive 
bases) in the DNA strings. Parity-checks were used to add controlled redundancy, which in conjunction with four-fold 
coverage allows for mitigating assembly error^J 

Due to dynamic changes in biotechnological systems, none of the three coding schemes represents a suitable solution 
from the perspective of current DNA sequencer designs: Huffman codes are fixed-to-variable length compressors that 
can lead to catastrophic error propagation in the presence of sequencing noise; the same is true of differential codes. 
Homopolymers do not represent a significant source of errors in Illumina sequencing platforms [6|, while single parity 
redundancy or RS codes and differential encoding are inadequate for combating error-inducing sequence patterns such as 
long substrings with high GC content [6]. As a result, assembly errors are likely, and were observed during the readout 
process described in pf]. 

An even more important issue that prohibits the practical wide-spread use of the schemes described in mm is that 
accurate partial and random access to data is impossible, as one has to reconstruct the whole text in order to read or 
retrieve the information encoded even in a few bases. Furthermore, all current designs support read-only storage. The 
first limitation represents a significant drawback, as one usually needs to accommodate access to specific data sections; the 
second limitation prevents the use of current DNA storage methods in architectures that call for moderate data editing, 
for storing frequently updated information and memorizing the history of edits. Moving from a read-only to a rewritable 
DNA storage system requires a major implementation paradigm shift, as: 

1 Another class of DNA error-correcting schemes based on Reed-Solomon (RS) codes was recently reported in 5j. 
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1. Editing in the compressive domain may require rewriting almost the whole information content; 

2 . Rewriting is complicated by the current data DNA storage format that involves reads of length 100 bps shifted by 
25 bps so as to ensure four-fold coverage of the sequence (See Figure m (a) for an illustration and description of the data 
format used in HI)- In order to rewrite one base, one needs to selectively access and modify four “consecutive” reads; 

3 . Addressing methods used in Hill only allow for determining the position of a read in a file, but cannot ensure 
precise selection of reads of interest, as undesired cross-hybridization between the primers and parts of the information 
blocks may occur. 

To overcome the aforementioned issues, we developed a new, random-access and rewritable DNA-based storage ar¬ 
chitecture based on DNA sequences endowed with specialized address strings that may be used for selective information 
access and encoding with inherent error-correction capabilities. The addresses are designed to be mutually uncorrelated 
and to satisfy the error-control running digital sum constraint 00- Given the address sequences, encoding is performed 
by stringing together properly terminated prefixes of the addresses as dictated by the information sequence. This encoding 
method represents a special form of prefix-synchronized coding [5]. Given that the addresses are chosen to be uncorrelated 
and at large Hamming distance from each other, it is highly unlikely for one address to be confused with another address 
or with another section of the encoded blocks. Furthermore, selection of the blocks to be rewritten is made possible 
by the prefix encoding format, while rewriting is performed via two DNA editing techniques, the gBlock and OE-PCR 
(overlap-extension polymerase chain reaction) methods |IDj 111] . With the latter method, rewriting is done in several steps 
by using short and cheap primers. The first method is more efficient, but requires synthesizing longer and hence more 
expensive primers. Both methods were tested on DNA encoded Wikipedia entries of size 17 KB, corresponding to six 
universities, where information in one, two and three blocks was rewritten in the DNA encoded domain. The rewritten 
blocks were selected, amplified and Sanger sequenced m to verify that selection and rewriting are performed with 100% 
accuracy. 


I Results 

The main feature of our storage architecture that enables highly sensitive random access and accurate rewriting is ad¬ 
dressing. The rational behind the proposed approach is that each block in a random access system must be equipped with 
an address that will allow for unique selection and amplification via DNA sequence primers. 

Instead of storing blocks mimicking the structure and length of reads generated during high-throughput sequencing, 
we synthesized blocks of length 1000 bps tagged at both ends by specially designed address sequences. Adding addresses 
to short blocks of length 100 bps would incur a large storage overhead, while synthesizing blocks longer than 1000 bps 
using current technologies is prohibitively costly. 

More precisely, each data block of length 1000 bps was flanked at both ends by two unique, yet different, address blocks 
of length 20 bps. These addresses are used to provide specificity of access (see Figure 0 (b) and the Supplementary 
Information for details). The remaining 960 bases in a block are divided into 12 sub-blocks of length 80 bps, with each 
block encoding six words of the text. The “word-encoding” process may be seen as a specialized compaction scheme suitable 
for rewriting, and it operates as follows. First, different words in the text are counted and tabulated in a dictionary. Each 
word in the dictionary is converted into a binary sequence of length sufficiently long to allow for encoding of the dictionary. 
For our current implementation and texts of choice, described in the Supplementary Information section, this length was 
set to 24. Encodings of six consecutive words are subsequently grouped into binary sequences of length 144. The two-bit 

II is appended as a word marker to the left hand side of each binary sequence of length 144, resulting in sequences of 
length 146 bits. The binary sequences are subsequently translated into DNA blocks of length 80 bps using a new family 
of DNA prefix-synchronized codes described in the Methods section. Our choice for the number of jointly encoded words 
is governed by the goal to make rewrites as straightforward as possible and to avoid error propagation due to variable 
codelengths. Furthermore, as most rewrites include words, rather than individual symbols, the word encoding method 
represents an efficient means for content update. Details regarding the counting and grouping procedure may be found in 
the Supplementary Information. 

For three selected access queries, the 1000 bps blocks containing the desired information were identified via primers 
corresponding to their unique addresses, PCR amplified, Sanger sequenced, and subsequently decoded. 

Two methods were used for content rewriting. If the region to be rewritten had length exceeding several hundreds, new 
sequences with unique primers were synthesized as this solution represents a less costly alternative to rewriting. For the 
case that a relatively short substring of the encoded string had to be modified, the corresponding 1000 bps block hosting 
the string was identified and the changes were generated via DNA editing. 


2 


gBlock based method 



Figure 1.1. (a) The scheme of uses a storage format consisting of DNA strings that cover the encoded compressed 
text in fragments of length of 100 bps. The fragments overlap in 75 bps, thereby providing 4-fold coverage for all except 
the flanking end bases. This particular fragmenting procedure prevents efficient file editing: If one were to rewrite the 
“shaded” block, all four fragments containing this block would need to be selected and rewritten at different positions to 
record the new “shaded” block, (b) The address sequence construction process using the notions of autocorrelation and 
cross-correlation of sequences eg. a sequence is uncorrelated with itself if no proper prefix of the sequence is also a suffix 
of the same sequence. Alternatively, no shift of the sequence overlaps with the sequence itself. Similarly, two different 
sequences are uncorrelated if no prefix of one sequence matches a suffix of the other. Addresses are chosen to be mutually 
uncorrelated, and each 1000 bps block is flanked by an address of length 20 on the left and by another address of length 
20 on the right (colored ends), (c) Content rewriting via DNA editing: the gBlock method [TO] for short rewrites, and the 
cost efficient OE-PCR (Overlap Extension-PCR) method [TTJ for sequential rewriting of longer blocks. 


Both the random access and rewriting protocols were tested experimentally on two jointly stored text files. One text 
file, of size 4 KB, contained the history of University of Illinois, Urbana-Champaign (UIUC) based on its Wikipedia entry 
retrieved on 12/15/2013. The other text file, of size 13 KB, contained the introductory Wikipedia entries of Berkeley, 
Harvard, MIT, Princeton, and Stanford, retrieved on 04/27/2014. 

Encoded information was converted into DNA blocks of length 1000 bps synthesized by IDT (Integrated DNA Tech¬ 
nologies), at a cost of $149 per 1000 bps (see http://www.idtdna.com/pages/products/genes/gblocks-gene-fragments). 
The rewriting experiments encompassed: 

1. PGR selection and amplification of one 1000 bps sequence and simultaneous selection and amplification of three 
1000 bps sequences in the pool. All 32 linear 1000 bps fragments were mixed, and the mixture was used as a template 
for PCR amplification and selection. The results of amplification were verified by confirming sequence lengths of 1000 
bps banks via gel electrophoresis (Figure 1.2 (a)) and by randomly sampling 3-5 sequences from the pools and Sanger 
sequencing them (Figure 1.2 (b)). 

2. Experimental content rewriting via synthesis of edits located at various positions in the 1000 bps blocks. For 
simplicity of notation, we refer to the blocks in the pool on which we performed selection and editing as Bl, B2, and 
B3. Two primers were synthesized for each rewrite in the blocks, for the forward and reverse direction. In addition, two 
different editing/mutation techniques were used, gBlock and Overlap-Extension (OE) PCR. gBlocks are double-stranded 
genomic fragments used as primers or for the purpose of genome editing, while OE-PCR is a variant of PCR used for 
specific DNA sequence editing via point editing/mutations or splicing. To demonstrate the plausibility of a cost efficient 
method for editing, OE-PCR was implemented with general primers (< 60 bps) only. Note that for edits shorter than 40 
bps, the mutation sequences were designed as overhangs in primers. Then, the three PCR products were used as templates 


for the final PCR reaction involving the entire 1000 bps rewrite. Figure 1 1.1 
In addition, a summary of the experiments performed is provided in Table 


(c) illustrates the described rewriting process. 
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Sequence identifier - Editing Method 

# of sequence samples 

Length of edits (bps) 

Selection accuracy/error percentage 

Bl-M-gBlock 

5 

20 

(5/5)/0% 

Bl-M-PCR 

5 

20 

(5/5)/0% 

B2-M-gBlock 

5 

28 

(5/5)/0% 

B2-M-PCR 

5 

28 

(5/5)/0% 

B3-M-gBlock 

5 

41 + 29 

(5/5)/0% 

B3-M-PCR 

5 

41 + 29 

(5/5)/0% 


Table 1. Selection, rewriting and sequencing results. Each rewritten 1000 bps sequence was ligated to a linearized 
pCRTM-Blunt vector using the Zero Blunt PCR Cloning Kit and was transformed into E. coli. The E. coli strains with 
correct plasmids were sequenced at ACGT, Inc. Sequencing was performed using two universal primers: M13F_20 (in 
the reverse direction) and M13R (in the forward direction) to ensure that the entire block of 1000 bps is covered. 



Church et.al. [3] 

Goldman et.al. [3] 

Our scheme 

Density 

0.7 x 10 ib B/g 

2.2 x 10 ib B/g 

4.9 x 10 2U B/g 

File size 

5.27MB 

739KB 

File size: 17KB 

Cost 

Not available 

$12,600 

$4,023 

Features 

Archival, no random-access 

Archival, no random-access 

Rewritable, random-access 


Table 2. Comparison of storage densities for the DNA encoded information expressed in B/g (bytes per gram), file size, 
synthesis cost, and random access features of three known DNA storage technologies. Note that the density does not reflect 
the entropy of the information source, as the text files are encoded in ASCII format, which is a redundant representation 
system. 


Given that each nucleotide has weight roughly equal to 650 daltons (650 x 1.67 x 10~ 24 grams), and given that 
27, 000 + 5000 = 32,000 bps were needed to encode a file of size 13 + 4 = 17 KB in ASCII format, we estimate a potential 
storage density of 4.9 x 10 20 B/g. This density significantly surpasses the current state-of-the-art storage density of 
2.2 x 10 15 bytes/g, as we avoid costly multiple coverage, use larger blocklengths and specialized word encoding schemes. 
A performance comparison of the three currently known DNA-based storage media is given in Table [S2| We observe that 
the cost of sequence synthesis in our storage model is significantly higher than the corresponding cost of the prototype 
in [U, as blocks of length 1000 bps are still difficult to synthesize. This trend it likely to change dramatically in the 
near future, as within the last seven months, the cost of synthesizing 1000 bps blocks reduced almost 7-fold. Despite its 
high cost, our system offers exceptionally large storage density, and for the first time, enables random access and content 
rewriting features. Furthermore, although we used Sanger sequencing methods for our small scale experiment, for large 
scale storage projects Next Generation Sequencing (NGS) technologies will enable significant reductions in readout costs. 


2 Methods 

2.1 Address Design and Encoding 

To encode information on DNA media, we employed a two-step procedure. First, we designed address sequences of short 
length which satisfy a number of constraints that makes them suitable for highly selective random access [13]. Constrained 
coding ensures that DNA patterns prone to sequencing errors are avoided and that DNA blocks are accurately accessed, 
amplified and selected without perturbing or accidentally selecting other blocks in the DNA pool. The coding constraints 
apply to address primer design, but also indirectly govern the properties of the fully encoded DNA information blocks. 
The design procedure used is semi-analytical, in so far that it combines combinatorial methods with computer search 
techniques. 

We required the address sequences to satisfy the following constraints: 

• (Cl) Constant GC content (close to 50%) of all their prefixes of sufficiently long length. DNA strands with 50% 
GC content are more stable than DNA strands with lower or higher GC content and have better coverage during 
sequencing. Since encoding user information is accomplished via prefix-synchronization, it is important to impose 
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(a) 


Bl-g-4 M13F (187) ATTATTCATTCTTGACTTCAC^^CCCCATTCCTCTCACT'- AACACTTCACTCI 

B1M (SI 

Consensus (187) ATTATTaATTGTTGAGTTGA.GAAGCGCATTGGTGTCACTCGTTGCTGGGTCATTTTCGGCGAGAGAAACAGTTCACTGT 




tt.agaagt;gcaataagtttttcctccctttcatccagtttbgcttaaacaagctgccattttcccaagtatgtcatg< 

(1)-CCTTAGAAGTCGCAATAAGTTTTTCCTCCCTTTCATCCAGTTTAGCTTAAACAAGCTGCCATTTTCCCAAGTATGTCATG' 

Consensus (97) CCTTAGAAGTCGCAATAAGTTTTTCCTCCCTTTCATCCAGT?TAGCTTAAACAAGC?GCCATTTTCCCAAGTATGTCATG< 


)) ( T TEA 
I) -GCTCTGTCfl 


PTTATCTATCAATATATTACAATAGTTTTATGATTCGTTCGCGCTCCTGAGCG'I 


'GAGCGTACATATO 

GAGCGTACATATO 


B2M-r (1)- 1 


GCTCTCTCAGGGGATCGTTTCTTTATCTATCAATATATTACAATACrrn 

(82) 


GCTGOTTACTGGTTCTTCTTCTACATTATCAAATTTTTTAGTGVATTTAACGAAGTACTGCCAAAACCAATCTCCCACCT( 


: (80) TTCAGG^fcTAGGCC'T’GATGATCTCGATGfGATGCGCGTCAGTCGAGTSCGGTArTG: 


lATAAGTGATTC 


Consensus (80) 


’GTGA' 

■CTCA' 


(b) 


Figure 1.2. (a) Gel electrophoresis results for three blocks, indicating that the length of the three selected and amplified 
sequences is tightly concentrated around 1000 bps. (b) Output of the Sanger sequencer, where all bases shaded in yellow 
correspond to correct readouts. The sequencing results confirmed that the desired sequences were selected, amplified, and 
rewritten with 100% accuracy. 


the GC content constraint on the addresses as well as their prefixes, as the latter requirement also ensures that all 
fragments of encoded data blocks have balanced GC content. 

• (C2) Large mutual Hamming distance, as it reduces the probability of erroneous address selection. Recall that the 
Hamming distance between two strings of equal length equals the number of positions at which the corresponding 
symbols disagree. An appropriate choice for the minimum Hamming distance is equal to half of the address sequence 
length (10 bps in our current implementation which uses length 20 address primers). 

• (C3) Uncorrelatedness of the addresses, which imposes the restriction that prefixes of one address do not appear as 
suffixes of the same or another address and vice versa. The motivation for this new constraint comes from the fact that 
addresses are used to provide unique identities for the blocks, and that their substrings should therefore not appear 
in “similar form” within other addresses. Here, “similarity” is assessed in terms of hybridization affinity. Furthermore, 
long undesired prefix-suffix matches may lead to read assembly errors in blocks during joint informational retrieval 
and sequencing. 


• (C4) Absence of secondary (folding) structures, as such structures may cause errors in the process of PCR amplifi¬ 
cation and fragment rewriting. 

Addresses satisfying constraints C1-C2 may be constructed via error-correcting codes with small running digital sum [7| 
adapted for the new storage system. Properties of these codes are discussed in Section |2.2| The novel notion of mutually 
uncorrelated sequences is introduced in |2.3| Constructing addresses that simultaneously satisfy the constraints C1-C4 
and determining bounds on the largest number of such sequences is prohibitively complex PH HSj. To mitigate this 
problem, we resort to a semi-constructive address design approach, in which balanced error-correcting codes are designed 
independently, and subsequently expurgated so as to identify a large set of mutually uncorrelated sequences. The resulting 
sequences are subsequently tested for secondary structure using mfold and Vienna nu We conjecture that the number 
of sequences satisfying C1-C4 grows exponentially with their length: proofs towards establishing this claim include results 
on the exponential size of codes under each constraint individually. 

Given two uncorrelated sequences as flanking addresses of one block, one of the sequences is selected to encode user 


information via a new implementation of prefix-synchronized encoding H3 HU, described in |2.4| The asymptotic rate of 
an optimal single sequence prefix-free codes is one. Hence, there is no asymptotic coding loss for avoiding prefixes of one 
sequence; we only observe a minor coding loss for each finite-length block. For multiple sequences of arbitrary structure, 
the problem of determining the optimal code rate is significantly more complicated and the rates have to be evaluated 


numerically, by solving systems of linear equations mi as described in |2.4| and the Supplementary Information. This 
system of equations leads to a particularly simple form for the generating function of mutually uncorrelated sequences, as 
explained in the Supplementary Information. 
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2.2 Balanced Codes and Running Digital Sums 

An important criteria for selecting block addresses is to ensure that the corresponding DNA primer sequences have prefixes 
with a GC content approximately equal to 50%, and that the sequences are at large pairwise Hamming distance. Due 
to their applications in optical storage, codes that address related issues have been studied in a different form under the 
name of bounded running digital sum (BRDS) codes [7J[8j. A detailed overview of this coding technique may be found 
in 0. 

Consider a sequence a = ao, ai, a 2 , • ■., cq,..., a n over the alphabet {—1,1}. We refer to Si (a) = ^i=o a * as the 
running digital sum (RDS) of the sequence a up to length l, l > 0. Let D a = max {(S'; (a)| : l > 0} denote the largest 
value of the running digital sum of the sequence a. For some predetermined value D > 0, a set of sequences is 

termed a BRDS code with parameter D if D a ^ < D for all i = 1,..., M. Note that one can define non-binary BRDS 
codes in an equivalent manner, with the alphabet usually assumed to be symmetric, {— q , — q + 1,..., —1,1,..., q — 1, q }, 
and where q > 1. A set of DNA sequences over {A,T,G, C} may be constructed in a straightforward manner by mapping 
each +1 symbol into one of the bases {A, T} , and —1 into one of the bases {G, C}, or vice versa. Alternatively, one can use 
BRDS over an alphabet of size four directly. 

To address the constraints C1-C2, one needs to construct a large set of BRDS codewords at sufficiently large Hamming 
distance from each other. Via the mapping described above, these codewords may be subsequently translated to DNA 
sequences with a GC content approximately equal to 50% for all sequence prefixes, and at the same Hamming distance as 
the original sequences. 

Let (n, C , d; D ) be the parameters of a BRDS error-correcting code, where C denotes the number of codewords of 
length n, d denotes the minimum distance of the code, while AilL equals the code rate. For D = 1 and d = 2, the best 
known BRDS-code has parameters ( n, 2 a, 2; l), while for D = 2 and d = 1, codes with parameters (n, 3^, 1; 2) exist. For 

D = 2 and d = 2, the best known BRDS code has parameters 2 • 3^ = ) _1 ,2; 2^ [Hj. Note that each of these codes has 
an exponentially large number of codewords, among which a (sufficiently) large number of sequences satisfy the required 
correlation property C3, discussed next, and the folding property C4. Codewords satisfying constraints C3-C4 were found 
by expurgating the BRDS codes via computer search. 


2.3 Sequence Correlation 


We describe next the notion of autocorrelation of a sequence and introduce the related notion of mutual correlation of 
sequences. 

It was shown in mi that the autocorrelation function is the crucial mathematical concept for studying sequences 
avoiding forbidden strings and substrings. In the storage context, forbidden strings correspond to the addresses of the 
blocks in the pool. In order to accommodate the need for selective retrieval of a DNA block without accidentally selecting 
any undesirable blocks, we find it necessary to also introduce the notion of mutually uncorrelated sequences. 

Let X and Y be two words, possibly of different lengths, over some alphabet of size q > 1. The correlation of X and 
Y, denoted by A' o Y, is a binary string of the same length as X. The i-th bit (from the left) of X o Y is determined 
by placing Y under X so that the leftmost character of Y is under the i-th character (from the left) of X, and checking 
whether the characters in the overlapping segments of X and Y are identical. If they are identical, the j-th bit of A o Y is 
set to 1, otherwise, it is set to 0. For example, for X = CATCATC and Y = ATCATCGG, X oY = 0100100, as depicted below. 

Note that in general, X o Y ^ Y o X, and that the two correlation vectors may be of different lengths. In the example 
above, we have Y o X = 00000000. The autocorrelation of a word X equals X o X. 

In the example below, X o X = 1001001. 


X = 
Y = 


CAT 
A T C 
A T 
A 


CAT 
A T C 
CAT 
T C A 
A T C 
A T 
A 


C 

G G 
C G G 
T C G 
A T C 
CAT 
T C A 
A T C 


G 

G G 
C G G 
T C G 
A T C 


G 

G 


0 

1 

0 

0 

1 

0 

G 0 
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Definition 1. A sequence X is self-uncorrelated if X o X = 10 ... 0. A set of sequences {Xi, X 2 , ■ ■ ■, X m } is termed 
mutually uncorrelated if each sequence is self-uncorrelated and if all pairs of distinct sequences satisfy Xi o Xj = 0 ... 0 
and Xj o Xi = 0 ... 0 . 

Intuitively, correlation captures the extent to which prefixes of sequences overlap with suffixes of the same or other 
sequences. Furthermore, the notion of mutual uncorrelatedness may be relaxed by requiring that only sufficiently long 
prefixes do not match sufficiently long suffixes of other sequences. Sequences with this property, and at sufficiently large 
Hamming distance, eliminate undesired address cross-hybridization during selection and cross-sequence assembly errors. 

We proved the following bound on the size of the largest mutually uncorrelated set of sequences of length n over 
an alphabet of size q = 4. The bounds show that there exist exponentially many mutually uncorrelated sequences for 
any choice of n, and the lower bound is constructive. Furthermore, the construction used in the bound “preserves” the 
Hamming distance (see the Supplementary Information). 

Theorem 2. Suppose that {Ad,..., X m } is a set of m pairwise mutually uncorrelated sequences of length n. Let u(n) 
denote the largest possible value of m for a given n. Then 

4-3? <u{n) < 9 • 4 n ~ 2 . 

As an illustration, for n = 20, the lower bound equals 972. The proof of the theorem is give in the Supplementary 
Information. 

It remains an open problem to determine the largest number of address sequences that jointly satisfy the constraints 
C1-C4. We conjecture that the number of such sequences is exponential in n, as the numbers of words that satisfy Cl- 
C2, C3 and C4 [15] are exponential. Exponentially large families of address sequences are important indicators of the 
scalability of the system and they also influence the rate of information encoding in DNA. 

Using a casting of the address sequence design problem in terms of a simple and efficient greedy search procedure, we 
were able to identify 1149 sequences for length n = 20 that satisfy constraints C1-C4, out of which 32 pairs were used for 
block addressing. Another means to generate large sets of sequences satisfying the constraints is via approximate solvers 
for the largest independent set problem |18j . Examples of sequences constructed in the aforementioned manner and used 
in our experiments are listed in the Supplementary Information. 

2.4 Prefix-Synchronized DNA Codes 

In the previous sections, we described how to construct address sequences that can serve as unique identifiers of the blocks 
they are associated with. We also pointed out that once such address sequences are identified, user information has to be 
encoded in order to avoid the appearance of any of the addresses, sufficiently long substrings of the addresses, or substrings 
similar to the addresses in the resulting DNA codeword blocks. For this purpose, we developed new prefix-synchronized 
encoding schemes based on m- 

To address the problem at hand, we start by introducing comma free and prefix-synchronized codes which allow for 
constructing codewords that avoid address patterns. A block code C comprising a set of codewords of length N over an 
alphabet of size q is called comma free if and only if for any pair of not necessarily distinct codewords a \02 ■ ■ ■ aj\r and 
6162 ... 6 jv in C, the N concatenations 0203 ... ajv&i, 0304 • ■ ■ & 162 , ■ ■ ■, a^a 1 ... &jv- 2 frjv-i are not in C Comma free 
codes enable efficient synchronization protocols, as one is able to determine the starting positions of codewords without 
ambiguity. A major drawback of comma free codes is the need to implement an exhaustive search procedure over sequence 
sets to decide whether or not a given string of length n should be used as a codeword or not. This difficulty can be 
overcome by using a special family of comma free codes, introduced by Gilbert [0] under the name prefix-synchronized 
codes. Prefix-synchronized codes have the property that every codeword starts with a prefix P = P 1 P 2 ■■■Pn, which 
is followed by a constrained sequence C 1 C 2 .. .c s . Moreover, for any codeword P 1 P 2 ■ ■ -PnCiC 2 ■ • - c s of length n + s, the 
prefix P does not appear as a substring of P 2 ■ ■ .p n ciC 2 ■ ■ ■ c s p\P 2 ■ ■ -p n - i- More precisely, the constrained sequences of 
prefix-synchronized codes avoid the pattern P which is used as the address. 

Due to the choice of mutually uncorrelated addresses at large Hamming distance, we encode each information block 
by avoiding only one of the address sequences, used for that particular block. 

To explain how to perform encoding, assume that P = P 1 P 2 ■ ■ - Pn G {A, T, G, C} n is a self-uncorrelated sequence. This 
guarantees that p\ ^ p n . Without loss of generality, let p\ = A and p n = G, and define 

Pi = {A,C,T} \ {pi} 

P l =pi...pi, 


7 


for all 1 < i < n. In addition, assume that the elements of Pi are arranged in increasing order, say using the lexicographical 
ordering A -< C -< T. We subsequently use pij to denote the j-th smallest element in Pi, for 1 < j < |P,| . For example, if 
Pi = {C, T} , then p it i = C and p ii2 = T. 

Next, we define a sequence of integers G nj i, G Ui2 , ■ ■ ■ that satisfies the following recursive formula 

IV, 1<£< n, 

n ’ e {Tr^lPilG^-i, £> n. 

For an integer £ > 0 and y < ?/, let 9( ( y) = {A, T, C} ( be a length-^ ternary representation of y. Conversely, for each 
IF € {A, T, C}\ let 9~ x (IF) be the integer y such that 9( (y) = IF. Every integer 0 < x < G n j can be mapped into a sequence 
of n + £ symbols {A, T, C, G} via an encoding algorithm that consists of two parts: EncodePSC(P, £, x) and CodePSC(P, £, x). 
Algorithm EncodePSC(P, £, x) calls CodePSC {P,£,x) and returns the concatenation of P and CodePSC (P,£,x). 

The steps of the encoding procedure are listed in Algorithm 1, where Cf = {EncodePSC(P, £, x) | 0 < x < G n ^}, and 
where n denotes the length of the sequence P. The decoding steps are described in the same chart. 


Algorithm 1 Prefix-synchronized encoding and decoding 

X = EncodePSC (P, l, x) 


return PCodePSC (P, £ , x) ; 


X = CodePSC (P, £, x 

) 

x = DecodePSC (P, A') 

begin 


begin 

1 n = length (P) ; 


1 n = length (P) ; 

2 if (£ > n) 


2 1 = length (A'); 

3 t := f; 


3 X = X 1 X 2 ...X i - 

H 

II 

Pi 


4 if (£ < n ) 

5 while (y > \ Pt 

: | Gn,£—t ) 

5 return d^ 1 (X ); 

6 y := y — |Ptj 


6 else 

7 t - 1 — h; 


7 find(s, t such that P t ~ 1 pt, s = A'i ... Xt ) ; 

8 end; 


8 return p| G n ,e-i) + (s — 1) G n ,t-t + DecodePSC (P, A' t+ i ... Xt) ; 

9 a ■— \ —— 


9 end; 

10 b := mod (y,G„ t e- t ); 

end; 

11 return P t ~ 1 pt, a + iCodePSC (P, l — t,b) ; 

12 else 


13 return 9e ( y ); 

14 end; 
end; 



The following theorems are proved in the Supplementary Information. 

Theorem 3. Cf is a prefix-synchronized codeword. 

Theorem 4. The algorithm EncodePSC(P, £, x) outputs a uniquely decodable string, for any 0 < x < G n g. 

A simple example describing the encoding and decoding procedure for the short address string P = AGCTG, which can 
easily be verified to be self-uncorrelated, is provided in the Supplementary Information. 

The previously described EncodePSC(P, £, x) algorithm imposes no limitations on the length of a prefix used for encod¬ 
ing. This feature may lead to unwanted cross hybridization between address primers used for selection and the prefixes of 
addresses encoding the information. One approach to mitigate this problem is to “perturb” long prefixes in the encoded 
information in a controlled manner. For small-scale random access/rewriting experiments, the recommended approach is 
to first select all prefixes of length greater than some predefined threshold. Afterwards, the first and last quarter of the 
bases of these long prefixes are used unchanged while the central portion of the prefix string is cyclically shifted by half 
of its length. For example, for the address primer ACTAACTGTGCGACTGATGC, the suffix ACTAACTGTGCGACTG produced by 
EncodePSC(P, £, x) maps to ACTAATGCCTGGACTG. The process of shifting applied to this string is illustrated below: 

ACTAACTGTGCGACTG 

cyclically shift by 3 


ACTAA TGCCTG GACTG 



For an arbitrary choice of the addresses, this scheme may not allow for unique decoding EncodePSC (P,£,x). However, 
there exist simple conditions that can be checked to eliminate primers that do not allow this transform to be “unique”. 
Given the address primers created for our random access/rewriting experiments, we were able to uniquely map each 
modified prefix to its original prefix and therefore uniquely decode the readouts. 

As a final remark, we would like to point out that prefix-synchronized coding also supports error-detection and limited 
error-correction. Error-correction is achieved by checking if each substring of the sequence represents a prefix or “shifted” 
prefix of the given address sequence and making proper changes when needed. 

3 Discussion 

We described a new DNA based storage architecture that enables accurate random access and cost-efficient rewriting. The 
key component of our implementation is a new collection of coding schemes and the adaptation of random-access enabling 
codes from classical storage systems. In particular, we encoded information within blocks with unique addresses that 
are prohibited to appear anywhere else in the encoded information, thereby removing any undesirable cross-hybridization 
problems during the process of selection and amplification. We also performed four access and rewriting experiments 
without readout errors, as confirmed by post-selection and rewriting Sanger sequencing. The current drawback of our 
scheme is high cost, as synthesizing long DNA blocks is expensive. Cost considerations also limited the scope of our 
experiments and the size of the prototype, as we aimed to stay within a budget comparable to that used for other existing 
architectures. Nevertheless, the benefits of random access and other unique features of the proposed system compensate 
for this high cost, which we predict will decrease rapidly in the very near future. 
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1. Encoding Wikipedia Entries - A Working Example (Section [TJ. 

2. Proofs of Theorems (Section [2|. 
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6 . Hybrid DNA-Based and Classical Storage (Section [6|. 


1 Encoding Wikipedia entries: A Working Example 


In this section we describe the data format used for encoding two files of size 17 KB containing the introductory sections 
of Wikipedia pages of six universities: Berkeley, Harvard, MIT, Princeton, Stanford, and University of Illinois Urbana- 
Champaign. There were 1, 933 words in the text, out of which 842 were distinct. Note that in our context, words are 
elements of the text separated by a space. For example, “university” and “university.” are counted as two different words, 
while “Urbana-Champaign” is counted as a single word. These 1,933 words were mapped to = 27 DNA blocks of 

length 1000 bps, as we grouped six words into fragments, and combined 12 fragments for prefix-synchronized encoding. 


Table SI provides the word counts in the files and encoding lengths (in bits) of the of the outlined procedure. 

Assume that instead of using a prefix-synchronized code, we used classical ASCII encoding without compression to 
encode the same Wikipedia pages. The total number of characters in the text equals 12,874, and each character is mapped 
to a binary string of length 7. Hence, one would need 12874 x 7 = 90118 bits to represent the data, which is equivalent 
to = 47 DNA blocks of length 1000 bps if we set aside two unique address flags for the blocks. As one can see, 

prefix-synchronized codes offer an almost 1.7-fold improvement in description length compared to ASCII encoding. This 
comes at the cost of storing a larger dictionary, as one encodes words rather than symbols of the alphabet. For the working 
example, one would require roughly 70-times larger dictionaries, as there are 1933 words with an average of 5.1 symbols 
per word. This increased in the dictionary is not a significant problem, as only one copy of the dictionary is ever needed. 
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ff symbols 

ff distinct 
symbols 

ff bits/distinct 
symbol 

ff bits 

Characters 

12874 

51 

6 

77244 

Words 

1933 

842 

12 

23196 


Table SI. Comparison between character and word based encoding. Note the the number of bits per distinct symbol for 
the word encoding case is computed as the ceiling of the logarithm of the number of distinct symbols plus one, where the 
extra bit is used to prevent very small integers from being used in prefix-synchronized coding. Such integers may produce 
long runs of the first symbol in the address, which should be avoided. Furthermore, to ensure fixed length encoding, and 
hence avoid catastrophic error propagation, we doubled the number of bits used for encoding to 24. 


2 Proofs of Theorems 

Proof of Theorem ^ The proof consists of two parts. First, we prove the upper bound on u (n) in Lemma 1, and then 
proceed to prove a lower bound in Lemma 2. Recall that u(n) denotes the largest possible size for a set of mutually 
uncorrelated words of length n. 

Lemma 1. Let u(n) the largest set of distinct mutually uncorrelated sequences of length n. Then 

u (n) < 9-4 n " 2 . 


Proof: To prove the lemma, let us introduce some terminology. Let dn{-, •) stand for the Hamming distance between 
two words, and define the Hamming ball of radius d around a point IF in {A,T,G, C} ,! as 

B (W, d) = {W' £ {A, T, G, C} n : d H {W, W') < d} . 

Furthermore, let 

C (IF, d) = {W £ {A, T, G, c} n :W'£B (IF, d), W', IF are correlated} 
denote the set of sequences correlated with W that are also at most at Hamming distance d from W. 


We claim that for n > d + 2 > 4, one has 


d—1 


|C(W,d)|>2]T 


i =0 


n — 1 


d-2 


3 ‘-E 


i =0 


n — 2 


3\ 


( 2 . 1 ) 


To prove the result, assume without loss of generality that IF starts with the symbol A, i.e., IF = AIF2 ... W n . Next, 
consider two scenarios regarding the structure of W = AIF2 ... W n : 

• W n 7^ A : In this case, any word W' in B (W, d) that starts with W n or ends with A is an element of C (W, d ). 

Let S = {W' :W' £ B {W, d ), W' starts with W n } and E = {W' : W' £ B (W, d) , W'ends with A} . 


Clearly, \S\ = \E\ = EiZ 


d-1 
0 


2Eto 


n— 1 
i 


3 l - Eto 


n — 1 
i 

n — 2 
i 


-2 


3* and |Sn£| = £“=0 
3*. 


n — 2 
i 


3L Therefore, \C{W,d)\ > |5UF;| = 


W n = A : In this case, any word W' in B (IF, d) which starts or ends with A is also an element of C (IF, d). Using an 
argument similar to the one described for the previous scenario, one can show that |C (IF, d) | > 2 Ei=o 


n—1 

i 


3*— 


\-~\d 
2 ^i =0 


n — 2 
i 


3 1 . 


Moreover, it is straightforward to see that 
d / „ \ d 


2E 

2=0 


n—1 




2 = 0 


n — 2 


d -1 

3* > 2 E 
2=0 


n—1 


d- 2 


3 * — E 


i—0 


n — 2 


3*. 
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For any mutually uncorrelated set {Xi, ... , X m } of size m, we have Xi £ C (Xi,n), for 2 < i < m. This implies that 

{X lt ..., X m } C {A, T, C, G}” \ C (X 1; n). 

At the same time, the previous claim suggests that 


\G{X 1 ,n)\>2Y J [ ", 1 )3 l -E 


i —0 

_ 2 . ^ n— 1 2 


2 — 0 


n — 2 


Therefore, m < 4" — (2 • 4 n 1 — 4 n 2 ) = 9 • 4" 2 , which completes the proof. 

Lemma 2. Let it(n) the largest set of distinct mutually uncorrelated sequences of length n. Then 


,{n) > 4 • 3 4 . 


Proof: For simplicity, assume that m is even. Given a mutually uncorrelated set {Xi,...,X m }, with words of 
length n and over the alphabet {A,T, G, C}, partition {Xi,...,X m } into two arbitrary sets A and B of equal size, say 
A = {X u ...,X f } and B = {X f+1) ... , X m } . We argue that C = {XY \ X € A, Y £ B} is a mutually uncorrelated 
set with words of length 2 n. 

• First, we show that the elements in C are self-uncorrelated: For an arbitrary element Z £ C, we have Z = 

XY. Since the two sequences {X, Y} are mutually uncorrelated, one can easily verify that Z\ ^ ^'in-i+i , f° r 
i £ {1,..., 2n — 1} \ {n} . Moreover, since X ^ Y , it holds that Z'f ^ This establishes the claim. 

• Next, we argue that any two distinct elements in C are uncorrelated: For any two distinct elements Z = XY and 
Z' = X'Y' in G, one can show that Z\ ^ {Z')^_ i+ 1, for i £ {1,..., 2n — 1} \ {n}. In addition, X/h 1 implies that 
Zf ^ (Z')^_ 1 . This completes the proof. 

As a result, given a mutually uncorrelated set {Xi,...,X TO }, where X, £ {A,T, C,G} ra , one can construct another 
mutually uncorrelated set l, ... jZ^J, where Z it £ {A,T, C,G} 2,! . Therefore, u{2n) > 11 _| n ' 1 . Observing that for n = 4 
it is possible to construct the following set of 12 mutually uncorrelated sequences 

{ATGC,ATAC,GTAC, GTGC 
ATTC,GTTC,AGGC,AAAC 
GAAC,GGGC,ATTT, GTTT} 

establishes the base of a recursive procedure which gives u (n) > 4 • (1.31)” . Note that this bound is constructive, and the 
concatenation procedure preserves normalized minimum Hamming distances. 

We now turn our attention to prefix-synchronized coding, and describe a number of results relevant for our subsequent 
discussion. 

Theorem 5 f|17]l. Given a positive integer N, chose the unique integer n = n ( N ) so that /3 = N2~ n satisfies 

log 2 < /3 < 2 log 2. 

Then, the maximal prefix-synchronized code of length N has cardinality 

N~ i2 JV -i/3e _/3 (1 T o (1)), as N —> oo, 

for a prefix of the form 10 ... 0. 

Note that the above results indicate that codes avoiding one address sequence represent an exponentially large family 
of binary sequences. We prove a similar result for the case of 4-ary sequences that avoid a set of m mutually uncorrelated 
sequences. To establish the claim, we need the following definitions. Let g (0), g (1),..., be an integer sequence over a 
finite alphabet. Define the generating function of the sequence 

OO 

G(z)=J 2 g(N)z~ N . 

N =0 
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Theorem 6. Suppose that {Xi,..., X m } is a set of mutually uncorrelated sequences of length n over the alphabet 
{A, T, C, G} . Let f (TV), with f (0) = 1, be the number of strings of length N over {A,T,C,G} that do not contain sub¬ 
strings in {Xi ,..., X m }. Then 

z N 

1 ; m+(z-A)z N ~ 1 ’ 

where F (z) is the generating function of the sequence {/(TV)}. 

Proof of Theorem [c| The result is a direct consequence of Theorem 4.1 of im For 1 < i < m, let fi (n) denote the 
number of strings of length n over {A, T, C, G} that contain no element of {Xi,... , X m }, except for a single copy of Xf at 
the right-hand side of the string. Let Fi (z) be the generating function of /,; (n). Then, we have the following system of 
equations that holds for the two sets of aforementioned functions: 

{z - 4) F (z) + zFx (z) + ... + zF m (z) = z 
F (z) — z (Xi o Xi) z F ± {z) -z{X 2 o Xi) z F 2 {z)-...-z{X m o X x ) z F m (z) = 0 


F (z) — z {Xi o X m ) z Fi (z) - ^ {X 2 o X m ) z F 2 {z)~...-z {X m o X m ) z F m {z) = 0 
By using the fact that ( Xi o Xf) z = z n ~ 1 , for 1 < i < m, and {Xi o Xj)_ = 0, for 1 < i ^ j < m, one can show that 

F{z) = z n Fi{z) = ... = z n F m {z). 


( 2 . 2 ) 


(2.3) 


The result follows by replacing (2.21 into the first line of (2.31. 

As the dominant pole of the generating function is close to 4, the number of sequences avoiding a set of mutually 
uncorrelated sequences grows roughly as 4". 

Proof of Theorem [5| Since P is self-uncorrelated, we need to show that this string is not contained in the output of 
CodePSC(P, £, x), where the output of CodePSC(P, /, x) equals 


CodePSC {P,£,x)=P H p tuSl ..-P tr Pt r , Sr 0 to (■), 


for some input 9 to (•), and 1 < t 0 , ti,..., t r < n. Consequently, if P is a substring of the output of CodePSC(P, £, x), then 
the last symbol of P (recall that we assumed this symbol to be G) has to appear in one of the following three positions: 

• The symbol appears in P ti_1 , for a unique 1 < i < r : In this case, there exists a suffix of P appearing as a prefix 

of P* i_1 . This contradicts our assumption that P is self uncorrelated. 

• The symbol appears in ■, for a unique 1 <i < r : This contradicts our assumption that pt ijSi 7^ G. 

• The symbol appears in 9t 0 (•) : This contradicts our assumption that G does not appear in 9 to (•) £ {A,T,C} io . 

Therefore, the string P does not appear as a substring in the output of CodePSC(P, to, x), which completes the proof. 

Proof of Theorem [7] It suffices to show that the output of CodePSC(P, £, x) is uniquely decodable. We use induction 
arguments to establish this result. For the basis step, by the definition of the output of CodePSC, it is straightforward 
to show that CodePSC(P, £, x) returns the encoding 9({x), which represents a one-to-one mapping from 0 < x < to 
{A, T, C} whenever i < n. For the inductive step, we assume that the result is true for all £ < r, as well as for all r > n, 
and show that it is consequently true for £ = r. 

For l = r, CodePSC(P, £, x) returns 

P t_1 p M CodePSC(P, i-t, b), 

for some integer values s,b and for some 1 < t < n, where x = A + (s — 1 ) G n ^-t + b. Therefore x 

is uniquely decodable if and only if s,t and b are unique. Since sequences of the form P t ~ 1 p tiS are prefix-free one can 
uniquely identify both t and s. Moreover £ — t < r, hence by the induction hypothesis it follows that b is also uniquely 
decodable from CodePSC {P,£ — t,b). Hence, x can be uniquely decoded. 
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Desgination of 
primer 

Sequence 

Bl-forward 

5 , AATTACTAAGCGACCTTCTC3 / 

B 1-reverse 

5'ACTTATTGCGACTTCTAAGG3' 

gBlock-B 1-reverse 

S'CTTCATAACAACTAACTGTGACS' 

Bl-SUl-reverse 

S'CGTGCACTCATAACCCATATTTCAAGAGCT 

AGCTATTCCTCTCCCTTAAAAGTAAATGAC3 , 

B1-SD 1-forward 

5'GGGAGAGGAATAGCTAGCTCTTGAAATAT 

GGGTTATGAGTGCACGATCATCACATAAC3' 

B2-forward 

S'AACCTAACCATCTTCCTCTCS' 

B2-reverse 

5'AAACGATCCCCTGACAGAGC3' 

gB lock-B 2-forward 

5'GAAGCACAGTGTTGCTGCGTG3' 

B2-SUl-reverse 

5'CAGCTTGTATCCCATCTCAACCCTAATTC 

CATAACCGTCAGCGCAGTTGACTAGTCTC3' 

B 2-SD 1-forward 

5'CTGCGCTGACGGTTATGGAATTAGGGTT 

GAGATGGGATACAAGCTGATATGGGAAC3' 

B3-forward 

5 / ATAATAGGCCTGATGATCTC3 / 

B3-reverse 

5 , AAGAAGAACCAGTAAGCAGC3 / 

B3-SUl-reverse 

5'AACATCTACTCACTCTCAATCTAAGCTTGA 

ACTGTGTACACACCATCGCTCTTGTACGCC3 , 

B3-SU2-forward 

5'GTGTACACAGTTCAAGCTTAGATTGAGAGT 

GAGTAGATGTTGATGCGAGGCGAAAGATGT3' 

B3-SD2-reverse 

5'GACTTCCCCCCTATAATCCATTAATGCTAG 

ATCAAGCCGCATATACTATGTTGCAAATAC3 , 

B3-SD2-forward 

5'GCGGCTTGATCTAGCATTAATGGATTA 

TAGGGGGGAAGTCGCTGCTGGTACTCTG3' 


Table S2. List of primers for rewriting (editing) the blocks Bl, B2 and B3. The primers for the gBlock method are listed 
separately for those used with the OE-PCR method. In the latter case, the labels of DNA fragments SU and SD stand for 
sample upstream and sample downstream. In OE-PCR, we linked two DNA fragments or three DNA fragments into the 
final PCR products; when two fragments were linked, the first fragment was labeled UP (U), while the second fragment 
was labeled DOWN (D); when three fragments were combined, the second fragment was labeled MIDDLE (M). 
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3 Address Sequences 

Consider the following set of strings of length 20, 

ACTAACTGTGCGACTGATGC 

ACACTATCGAGCTGACACGT 

AGTCAGCAGTAGTCAGTCAG 

ACTGAGCTGAGCGTATATCG 

ACTCAGCTACGACTCACATG 

with GC content equal to 50%, i.e., 10 GC bases. The sequences are mutually uncorrelated and at Hamming distance 
exactly 10 from each other. The sequences do not exhibit secondary structures at room temperature, as verified by the 
mfold and Vienna packages. We used these addresses for a very small-scale, proof-of-concept random access/rewriting 
experiment of a 4 KB file. 

In the large scale random access/rewriting experiment described in Section [5] we used different address sequences for 
the two flanking ends of the 1000 bps blocks. The sequences we synthesized include: 


block 1: (CTCTTCCAGCGAATCATTAA, ACTTATTGCGACTTCTAAGG) 
block 2: (CTCTCCTTCTACCAATCCAA,AAACGATCCCCTGACAGAGC) 
block 3: (CTCTAGTAGTCCGGATAATA,AAGAAGAACCAGTAAGCAGC) 
block 4: (CTCTTTCGCTGTGCACAAAA,AAATCGGAAATTCGTGTCGC) 
block 5: (CTCTGCTGGAAATGTGTGAA,AATTCACGGTCCGAAACACC) 
block 6: (CTCTGTTCCTCCTTTCTCGT,TGTAGACGATTTGATTGGCG) 
block 7: (CTCTAGCAACTTCCGCAAAT,ACGAGATTCATACCGGACCC) 
block 8: (CTCTAGCTTCCCTATCCATA,TGCAGAAGAGGAGTGTCAGC) 
block 9: (CTCTATAGGCTCTGGTATGT,TTTAACCCGCCCGTACAGCC) 
block 10: (CTCTCGCTCATCTCATGTTT,ACAGTACTTGCCCAATTCGC) 
block 11: (CTCTGTACTCCGCTGAATCA,TAAACATTACAAGCCCCTCG) 
block 12: (CTCTTCTTCCCTGACGATGT,AATACAACTTCTAACCACCC) 
block 13: (CTCTTGATCCTACTGAGAAA, TTAATAGTTCCCGGCAGCCC) 
block 14: (CTCTAGTGACGTGACAGGTA,TTAGAACGAACCAGTATAGC) 
block 15: (CTCTACCTAAGGCCTTTGAA,TTGACCCATGAGCCAGCACC) 
block 16: (CTCTACAGTAGTAAACTCGT,TGCTGAACTCTAATCTGTCC) 
block 17: (CTCTGGGCGGCTGTACACAA,ATACACTCATAACACCTCGG) 
block 18: (CTCTGCGATCACAAAAAGTT,ACAACTATACGTGTCGGACC) 
block 19: (CTCTTTAGCACGAGTCCTAT,TGAACCCGTCGTGCTAATCG) 
block 20: (CTCTAATACGCACGCCCATT, ATACGGGATACAATTAGGGC) 
block 21: (CTCTGAGGCGTGGATATTTT,AATACATCCCTAAAAGCCGG) 
block 22: (CTCTGCGTGTTCATTCCATT,TGAGGATAGGATTAGTAAGG) 
block 23: (CTCTAAGAATCTGACTGCAT,ATGTTAACACTGAGTAAGGG) 
block 24: (CTCTGATCGAACCCATGTCA,ACATGACCTACATAACGTCC) 
block 25: (CTCTCTGGTGGCCTAAAAAT,AACAGAGATCAGAGCAGTGG) 
block 26: (CTCTAGAGAAACGTTGAAGT,AACCCGTACTCACTATGCCG) 
block 27: (CTCTGACGTCTACACAACAT,TTTGTAGATCCCAAGCATCG) 

The pairs of sequences were used to flank the two ends of the data blocks. Only the addresses on the left were used 
for subsequent prefix-synchronized coding. 
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The sequences on the left-hand side of the pairing have “interleaved” {G, C} and {A,T} bases - for example, they all 
start with CTCT .... This ensures a “GC balancing” property for the prefixes of the addresses. 

4 Encoding and Decoding Example 

In this section, we illustrate the encoding and decoding procedure for the short address string P = AGCTG, which can 
easily be verified to be self-uncorrelated. 

More precisely, we explain how to compute a sequence of integers G nj i, G n< 2 ,..., G nj 7, described in the main body of 
the paper. As before, n denotes the length of the address string, which in this case equals five. 

One has 

(G„,i, G„ i2 , ..., G„, r ) = (3,9,27,81,267,849,2715). 

The algorithm CodePSC(P, 8, 550) produces: 

550 = 0 x G 5 ,7 + 550 

=> CodePSC(P, 8, 550) = CCodePSC(P, 7, 550) 

550 = 0 x G 5 , 6 + 550 

=> CodePSC(P, 7, 550) = CCodePSC(P, 6,550) 

550 = 2 x G 5 ,5 + 0 x G 5 ,4 + 16 
=> CodePSC(P, 6, 550) = AACodePSC(P, 4,16), 

16 = 0 x 3 3 + 1 x 3 2 + 2 x 3 1 + 1 x 3° 

=> CodePSC(P, 4,16) = ATCT, 

=> CodePSC(P, 8, 550) = CCAAATCT 


When running DecodePSC(P, X) on the encoded output X = CCAAATCT , the following steps are executed: 


=> DecodePSC(P, CCAAATCT) = 0 x G 5 , 7 
+ DecodePSC(P, CAAATCT) 

=> DecodePSC(P, CAAATCT) = 0 X G 5j6 
+ DecodePSC(P, AAATCT), 

=> DecodePSC(P, AAATCT) = 2 x G 5j5 + 0 x G 5j4 
+ DecodePSC(P, ATCT) 

=> DecodePSC(P, ATCT) = 16 

=> DecodePSC(P, CCAAATCT) = 2 x G 5 , 5 + 16 = 550 

5 Experimental Synthesis, Access and Rewrite of DNA Sequences 

A total of 27 sequences of length 1000 bps each were designed to encode information retrieved from the Berkeley, Harvard, 
MIT, Princeton, Stanford, and UIUC Wikipedia page in 2014. Except for sequence #4, which was rejected due to the 
complexity of its secondary structure, all sequences were synthesized by IDT (Integrated DNA Technologies). In addition, 
27 corresponding address primers were synthesized by the same company. The address sequences of the blocks are listed 
in Section [3] 

As a proof of concept, we performed a number of selection and editing experiments. These include selecting individual 
blocks and rewriting one of its sections, selecting three blocks and rewriting three sections in each, two close to the 
flanking ends, and one in the middle. The edits involved information about the budget of the institutions at a given 
year of operation. Detailed information about the original sequences and their rewritten forms is given in the following 
sections. 


16 






Sequence identifier 

Number of 

sequence 

samples 

Length of the 
edited region 
(in bps) 

Selection accuracy 
/ readout error 
percentage 

Description of 
editing method 

Bl-M-gBlock 

5 

20 

5/5/0% 

gBlock method 

Bl-M-PCR 

5 

20 

5/5/0% 

OE-PCR method 

B2-M-gBlock 

5 

28 

5/5/0% 

gBlock method 

B2-M-PCR 

5 

28 

5/5/0% 

OE-PCR method 

B3-M-gBlock 

5 

41 + 29 

5/5/0% 

gBlock method 

B3-M-PCR 

5 

41 + 29 

5/5/0% 

OE-PCR method 


Table S3. Selection, rewriting and sequencing results. Each rewritten 1000 bps sequence was ligated to a linearized 
pCRTM-Blunt vector using the Zero Blunt PCR Cloning Kit and was transformed into E. coli. The E. coli strains with 
correct plasmids were sequenced at ACGT, Inc. Sequencing was performed using two universal primers: M13F_20 (in 
the reverse direction) and M13R (in the forward direction) to ensure that the entire blocks of 1000 bps are covered. 

A 


B 


± 

O’ 


PCR amplified fragment 
which has 97 bps overlap 
with gBlock fragment 


gBIock of 250 bps sequence 
containing the entire mutation 
region (80 bp) 


OE-PCR amplification of the 
entire mutation 


Final PCR method 


Fig. SI. A) Schematic depiction of the editing method using gBlocks. B) Detailed description of the generation of the 
mutation. Four sequences (ranging in length from 177 to 588 bps) containing the entire edit region were gBlock synthesized 
from IDT. The remaining parts of the 1000 bps sequences were PCR amplified. A homology in at least 30 bps between 
the flanking end sequence of the blocks and the corresponding end of the gBlock fragment was created. By one OE-PCR, 
the desired edits were generated in a one-pot matter. 


We denoted the blocks on which we performed selection and editing by Bl, B2, and B3. The primers used for performing 
the edits in the blocks are listed in Table [S2] Note that two primers were synthesized for each rewrite, for the forward 
and reverse direction. In addition, two different editing (mutation) techniques were used, gBlock and Over lap-Extension 
(OE) PCR; gBlocks are double-stranded genomic fragments that are frequently used as primers, for gene construction or 
for mediated genome editing. An illustration of editing via gBlocks is shown in Fig. |S1| On the other hand, OE-PCR is a 
variant of PCR used for specific DNA sequence editing via point mutations or splicing. An illustration of the procedure 
is given in Fig. [ST] To demonstrate the plausibility of a cost efficient method for editing, OE-PCR was used with general 
primers (< 60 bps) only. For edits shorter than 40 bps, the mutation sequences were designed as overhangs in primers. 
Then, the three PCR products were used as templates for the final PCR reaction involving the entire 1000 bps rewrite. 

All 27 linear 1000 bps fragments were mixed, and the mixture was used as a template for PCR amplification and 
selection of the Bl, B2 and B3 sequences. The results of selection are shown in Fig |S2[ where three banks of size 1000 bps 
are depicted. These banks indicate that sequences of the correct length were isolated. Subsequent sequencing confirmed 
that the sequences were indeed the user requested Bl, B2 and B3 strands. A summary of the experiments performed is 
provided in Table [S3] 
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Fig. S2. PCR of 1000 bps sequences-B 1, B2, B3 from a mixture of 26 sequences. 

5.1 B1 mutation Bl-M synthesis 

The unedited Bl_original (Bl) sequence is of the form: 

AATTACTAAGCGACCTTCTCGGATAGAACGCTTAGTTGGTGCGTTGACAT 

GCTCGAACTGATCATCGGTCACTTGCATTCATTATTGATTGTTGAGTTGA 

GAAGCGCATTGGTGTCACTCGTTGCTGGGTCATTTTCGGCGAGAGAAACA 

GTTCACTGTGGCGTGATGTTTTGAAATGAGGGAGAGTTCTCTTAACTGCA 

GTTGGAGTTCAGTATACTCGGGATAGTGTAACAGAGGGAGGCGGATGTGT 

GTATTGATGTGAAGTCTTTCACGTGCGGGCTAGGTCGTAATGACGGGTCG 

GGAACTATTCATTGGCGCAATAGTGATTTTGATGAATGATGGATAGAACG 

CTTAAAGGGAAACTATATAGTTCAAAGCTCGTCGGCGGTGTCGAGGATGT 

ATAGGGGTTAATGAATGGTGGAACTTACTTATACTATAGATTGGACTGGT 

GGTATGAGAACTTCACTAATTATTGACGTCACAGTTAGTTGTTATGAAGT 

GATAATATGAATCGAGCGCAACAGGACTAGTCATTTACTTTTAAGGGAGA 

GGAATAGCTAATCTCAAATTTTTTTTATGTGAGTGCACGATCATCACATA 

ACATAGGAGGCGATGAGACAGCGACTCAATCTGACTAATTCATTATAGGA 

GTTATATGAAGAGTTCGGAACGAAGCTAGCGCTTTCGCACAATGCGAGGG 

ATAAGAGCGGGTGCAGAGCGAAGGGTGTGAAATTGATGGTGGATAAGAAC 

TTCGCACAGTACTAGCTAGTGGGGAGAGACTTCTATGAATTCGGAGGGAT 

ACTTGATATTGATATGGGGGGATGGCGCTATTAAGCGCAGAGCGTAAGTG 

CGCTTCAAATCGAACATTGTGTAGCTAAGCAATAGAGAAATGTGGGGATT 

GAGCAGTTCGTATCGGTTCGCATGACATACTTGGGAAAATGGCAGCTTGT 

TTAAGCTAAACTGGATGAAAGGGAGGAAAAACTTATTGCGACTTCTAAGG 

where the bases written in red represent the regions we edited. 
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Fig. S3. Illustration of the process of generating the B1 edit/mutation using general primers. 


The edited Bl_mutation (B1_M) sequence reads as: 

AATTACTAAGCGACCTTCTCGGATAGAACGCTTAGTTGGTGCGTTGACAT 

GCTCGAACTGATCATCGGTCACTTGCATTCATTATTGATTGTTGAGTTGA 

GAAGCGCATTGGTGTCACTCGTTGCTGGGTCATTTTCGGCGAGAGAAACA 

GTTCACTGTGGCGTGATGTTTTGAAATGAGGGAGAGTTCTCTTAACTGCA 

GTTGGAGTTCAGTATACTCGGGATAGTGTAACAGAGGGAGGCGGATGTGT 

GTATTGATGTGAAGTCTTTCACGTGCGGGCTAGGTCGTAATGACGGGTCG 

GGAACTATTCATTGGCGCAATAGTGATTTTGATGAATGATGGATAGAACG 

CTTAAAGGGAAACTATATAGTTCAAAGCTCGTCGGCGGTGTCGAGGATGT 

ATAGGGGTTAATGAATGGTGGAACTTACTTATACTATAGATTGGACTGGT 

GGTATGAGAACTTCACTAATTATTGACGTCACAGTTAGTTGTTATGAAGT 

GATAATATGAATCGAGCGCAACAGGACTAGTCATTTACTTTTAAGGGAGA 

GGAATAGCTAGCTCTTGAAATATGGGTTATGAGTGCACGATCATCACATA 

ACATAGGAGGCGATGAGACAGCGACTCAATCTGACTAATTCATTATAGGA 

GTTATATGAAGAGTTCGGAACGAAGCTAGCGCTTTCGCACAATGCGAGGG 

ATAAGAGCGGGTGCAGAGCGAAGGGTGTGAAATTGATGGTGGATAAGAAC 

TTCGCACAGTACTAGCTAGTGGGGAGAGACTTCTATGAATTCGGAGGGAT 

ACTTGATATTGATATGGGGGGATGGCGCTATTAAGCGCAGAGCGTAAGTG 

CGCTTCAAATCGAACATTGTGTAGCTAAGCAATAGAGAAATGTGGGGATT 

GAGCAGTTCGTATCGGTTCGCATGACATACTTGGGAAAATGGCAGCTTGT 

TTAAGCTAAACTGGATGAAAGGGAGGAAAAACTTATTGCGACTTCTAAGG 


with rewrites listed in red. 

5.1.1 The gBlock method 

Since a gBlock of length longer than 500 bps was needed, it was more costly to synthesize the gBlock and perform rewriting 
than to directly re-synthesizing the whole block. Hence, the gBlock method was not used in this case. 


5.1.2 


The OE-PCR based method 













One pair of primers was designed to PCR amplify the first portion of the sequence Bl-M. For the forward direction, 
the primer was 

5 ! AATTACTAAGCGACCTTCTC3’ 


while for the reverse direction, the primer was 

5’CGTGCACTCATAACCCATATTTCAAGAGCTAGCTATTCCTCTCCCTTAAAAGTAAATGAC3’. 
The second part of the sequence was PCR amplified by using the forward direction primer 

5’GGGAGAGGAATAGCTAGCTCTTGAAATATGGGTTATGAGTGCACGATCATCACATAAC3’ 


and reverse direction primer 


5’ACTTATTGCGACTTCTAAGG3’. 


Both PCR reactions used the sequence B1 as template. Two such PCR products are shown in Fig. |S4[ indicating that 
the correct length products were isolated in each reaction. 

OE-PCR was performed in a 50 ul reaction volume containing the two aforementioned PCR products without primers 
for the first 5 cycles and the products with primers (B1 primers in Table S2) for the later 30 cycles. A single bank with 
correct size of 1000 bps was obtained (see Fig. S4|. 


5.2 B2 mutation B2-M synthesis 

The unedited B2_original (B2) sequence is of the form: 


AACCTAACCATCTTCCTCTCGATTTGGAGCAGATTGGTATTATTCTAGTC 

GTCGAGACTAGTCAACTGCGCTAGTTTGTGTTCATAAAATAAGAGTATGA 

GATACAAGCTGATATGGGAACTTAATTACGAAGCACAGTGTTGCTGCGTG 

GACTTGTGAAGTAGGGTGTGAGATAAGAATGATAGCGAACGCAGCGTATG 

GCTGAAGTGCTGGGCATATTGTGGTGTGGACATCTCAAAGTCTATGAAGA 

TTGGTAATAGGATGGTCTCTCGGGTCTCAAACTTCGTCAGGCAGCATTGT 

GCATGCGAGTGATTGAAAGGGAGGGTAAGGGTTATTAATAGAAAAGACTT 

ACAGGCGTTGGTATGATTCAAGATCGCAAGAATCGTGTGAGCTTGAGGAC 

TAAATAGTTTAAAGAAATAGGAATAGTTGTAATTTAAGGAGCGTGGCACG 

GATGGATCAGCGTGTCAACGGAACGCGCATTTGGGAGTTTTATGTTAAGT 

GAGCAGACTAAGGTGAAATTCAATAGTCTCTATCGTTCGAGGGTTATTGC 

TAGGGGAGACTTTGAGTGAGTGGTAATTTTGAAGCAGTATACGTAACTTT 

TTCGATTCTTAGTGGCAGTTACTCTGAATTTTAGTGTGAGCAGAGTGTGA 

TAAATAGAGAGATACGAGGTCGACACGGCTGTTGGGGGCACTTAACAGTA 

GGGGGTTGATGCTGGCGGACACTAAAGGATTTTTGAAGGGGATTGTTGGC 

GACTCACATCTAAGTGGTATTGCGGGCTCTATGAGAATCTGCTCGAGTCA 

TCTAGGTTGAGGAAGAGGGGGAGATTCTCGTTAAAGACAGTACATATTTC 

GCATACTTCTTAACGTGGAGTATGAATGTCAATGGTGGGAGATATGGGTG 

GAGGGATTTCATTCACTGCATATGTACGCTCAGGAGCGCGAACGAATCAT 

AAAACTATTGTAATATATTGATAGATAAAGAAACGATCCCCTGACAGAGC 
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B2-> B2M 



Fig. S4. 


A schematic depiction of the process of generating the B2 mutation using standard 60 bps primers. 


The edited B2_mutation (B2_M) sequence is: 

AACCTAACCATCTTCCTCTCGATTTGGAGCAGATTGGTATTATTCTAGTC 

GTCGAGACTAGTCAACTGCGCTGACGGTTATGGAATTAGGGTTGAGATGG 

GATACAAGCTGATATGGGAACTTAATTACGAAGCACAGTGTTGCTGCGTG 

GACTTGTGAAGTAGGGTGTGAGATAAGAATGATAGCGAACGCAGCGTATG 

GCTGAAGTGCTGGGCATATTGTGGTGTGGACATCTCAAAGTCTATGAAGA 

TTGGTAATAGGATGGTCTCTCGGGTCTCAAACTTCGTCAGGCAGCATTGT 

GCATGCGAGTGATTGAAAGGGAGGGTAAGGGTTATTAATAGAAAAGACTT 

ACAGGCGTTGGTATGATTCAAGATCGCAAGAATCGTGTGAGCTTGAGGAC 

TAAATAGTTTAAAGAAATAGGAATAGTTGTAATTTAAGGAGCGTGGCACG 

GATGGATCAGCGTGTCAACGGAACGCGCATTTGGGAGTTTTATGTTAAGT 

GAGCAGACTAAGGTGAAATTCAATAGTCTCTATCGTTCGAGGGTTATTGC 

TAGGGGAGACTTTGAGTGAGTGGTAATTTTGAAGCAGTATACGTAACTTT 

TTCGATTCTTAGTGGCAGTTACTCTGAATTTTAGTGTGAGCAGAGTGTGA 

TAAATAGAGAGATACGAGGTCGACACGGCTGTTGGGGGCACTTAACAGTA 

GGGGGTTGATGCTGGCGGACACTAAAGGATTTTTGAAGGGGATTGTTGGC 

GACTCACATCTAAGTGGTATTGCGGGCTCTATGAGAATCTGCTCGAGTCA 

TCTAGGTTGAGGAAGAGGGGGAGATTCTCGTTAAAGACAGTACATATTTC 

GCATACTTCTTAACGTGGAGTATGAATGTCAATGGTGGGAGATATGGGTG 

GAGGGATTTCATTCACTGCATATGTACGCTCAGGAGCGCGAACGAATCAT 

AAAACTATTGTAATATATTGATAGATAAAGAAACGATCCCCTGACAGAGC 

where, as before, red letters were used to indicate the rewritten region. 


5.2.1 The gBlock method 

A 177 bps sequence, containing the entire edited region and the B2 string, was gBlock synthesized by IDT. Another part 
of B2 was PCR amplified using the forward primer 

5’GAAGCACAGTGTTGCTGCGTG3’ 


and reverse primer 


5’AAACGATCCCCTGACAGAGC3’ 


The B2 sequence served as a template. See Fig. [S4] for an illustration. 
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Fig. S5. PCR products of B1 and B2. 


B3-Gblock 



B3-MIDDLE 

B3-D0WN 



Fig. S6. PCR products of B3. 


5.2.2 The OE-PCR based method 

Over extension PCR (OE-PCR) was performed in a 50 ul reaction volume containing the above 177 bps gBlock product 
and PCR products without primers for the first 5 cycles and with B2 forward and reverse primers listed in Table |S2| for 
the subsequent 30 cycles. 

The PCR product was deposited on a gel substrate and the correct 1000 bps band was obtained as shown in Fig. |S5| 
One pair of primers was designed to PCR amplify the first part of the sequence B2-M, with forward primer 

5'AACCTAACCATCTTCCTCTC3’ 

and reverse primer 

5’CAGCTTGTATCCCATCTCAACCCTAATTCCATAACCGTCAGCGCAGTTGACTAGTCTC3’. 

The second part was PCR amplified by the forward primer 

5’CTGCGCTGACGGTTATGGAATTAGGGTTGAGATGGGATACAAGCTGATATGGGAAC3’ 


and reverse primer 


5’ AAACGATCCCCTGACAGAGC3’. 


Both PCRs used B2 as a template. Two PCR products are shown in Fig. S5 
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B3-> B3M 
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Fig. S7. Scheme for generating the B3 edits using standard 60 bps primers. 


5.3 B3 mutation B3-M synthesis 

The unedited original B3 sequence equals: 

ATAATAGGCCTGATGATCTCGATGGATGCGCGTCACTCGAGTGCGGTAGG 

CACGTCTCAGGTGATAAGTGATTGTGATTGTAGGTGAAGGGGGTAGAAAT 

GATTGAGGAAACTTGTGTACTCGTTACACGTGATAGGGTTTGATCGGCGG 

TGGAAAAATTAGGGATGGGGATAAGATTATGGGATCGTTCTCAATAATTG 

TTACGATATCGTTGTTACACAGTTGTTACGCTACGACGTCATCGATAAAG 

GTGGGTATGTGGGGGTACTATACTCTTGGGGGCGTACAAGAGCGATGGTT 

GGTCGGATTGAAATTAAAAGCATTAAGAGGTTAATTTATAGATGCGAGGC 

GAAAGATGTGAGCGCAAGTAAAGGAAACGCGAGCAAGTGATTGTTACTAA 

TTATATTAGGAGGTGATGAGGAGCGTGGTTATCTTATTGGGCGAGCTGCA 

GCGAATTCTAGATTTCTTCGAGTTACAGTCGTAGTGATGTATATAGAGTG 

GATGCGCACATTATTACATATATCGTCGAATTGGATTAGACGCAAAGAAA 

ATGCGGCATTGTAATGGGTTGTGTAAAATTGAGCGTGGTTATCTTGTCAT 

GACATAGTAAAAGTTGCTCAATTGATTGAAGCTCGATTAGGAGAAGTAAT 

TTGAAAAAAGGATAGACTAGGACTCAACGAGGAACGGGTATTTGCAACAT 

AGTATATGCGGTCTTAATCGGAGGGTAATGTTATTTGTGTGGAAGTCGCT 

GCTGGTACTCTGGGCGTTTAGGATGAATCTTCGAAACTAGGCTTTGTCAG 

AGATAGTTTGTTGGTAAGAAGAATCAGGAAACGGTAACAGAGAATAAATG 

AATTAACGTAGCAAGATTTCGTCTTTCTGGAGATGAGAAGGTGTAGTTGA 

GGAGTCGACGTTCTTTACGGAGGTGGGAGATTGGTTTTGGCAGTACTTCG 

TTAAATACACTAAAAAATTTGATAATGTAGAAGAAGAACCAGTAAGCAGC 
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The edited sequence B3_M mutation sequence is: 


ATAATAGGCCTGATGATCTCGATGGATGCGCGTCACTCGAGTGCGGTAGG 

CACGTCTCAGGTGATAAGTGATTGTGATTGTAGGTGAAGGGGGTAGAAAT 

GATTGAGGAAACTTGTGTACTCGTTACACGTGATAGGGTTTGATCGGCGG 

TGGAAAAATTAGGGATGGGGATAAGATTATGGGATCGTTCTCAATAATTG 

TTACGATATCGTTGTTACACAGTTGTTACGCTACGACGTCATCGATAAAG 

GTGGGTATGTGGGGGTACTATACTCTTGGGGGCGTACAAGAGCGATGGTG 

TGTACACAGTTCAAGCTTAGATTGAGAGTGAGTAGATGTTGATGCGAGGC 

GAAAGATGTGAGCGCAAGTAAAGGAAACGCGAGCAAGTGATTGTTACTAA 

TTATATTAGGAGGTGATGAGGAGCGTGGTTATCTTATTGGGCGAGCTGCA 

GCGAATTCTAGATTTCTTCGAGTTACAGTCGTAGTGATGTATATAGAGTG 

GATGCGCACATTATTACATATATCGTCGAATTGGATTAGACGCAAAGAAA 

ATGCGGCATTGTAATGGGTTGTGTAAAATTGAGCGTGGTTATCTTGTCAT 

GACATAGTAAAAGTTGCTCAATTGATTGAAGCTCGATTAGGAGAAGTAAT 

TTGAAAAAAGGATAGACTAGGACTCAACGAGGAACGGGTATTTGCAACAT 

AGTATATGCGGCTTGATCTAGCATTAATGGATTATAGGGGGGAAGTCGCT 

GCTGGTACTCTGGGCGTTTAGGATGAATCTTCGAAACTAGGCTTTGTCAG 

AGATAGTTTGTTGGTAAGAAGAATCAGGAAACGGTAACAGAGAATAAATG 

AATTAACGTAGCAAGATTTCGTCTTTCTGGAGATGAGAAGGTGTAGTTGA 

GGAGTCGACGTTCTTTACGGAGGTGGGAGATTGGTTTTGGCAGTACTTCG 

TTAAATACACTAAAAAATTTGATAATGTAGAAGAAGAACCAGTAAGCAGC 

5.3.1 The Gblock method 

Two sequences, the 560 bps sequence containing the first mutation region and the second 560 bps sequence containing the 
second mutation region, were gBlock synthesized by IDT. There was a 60 bps overlap between the two gBlocks. 

5.3.2 The OE-PCR method 

OE-PCR was performed in a 50 ul reaction volume containing the above two 560 bps gBlock products without primers 
for the first 5 cycles and additional B3 forward and reverse primers listed in Table [S2| for the subsequent 30 cycles. The 
PCR product was deposited on a gel substrate and the correct 1000 bps band was obtained. 

One pair of primers was designed to PCR amplify the first part of the sequence B2-M, using 

5'ATAATAGGCCTGATGATCTC3’ 


in the forward direction and 

5’AACATCTACTCACTCTCAATCTAAGCTTGAACTGTGTACACACCATCGCTCTTGTACGCC3’ 


in the reverse direction. 

The second part was PCR amplified in the forward direction by using the primer 

5’GTGTACACAGTTCAAGCTTAGATTGAGAGTGAGTAGATGTTGATGCGAGGCGAAAGATGT3’ 
and in the reverse direction by using the primer 

5’GACTTCCCCCCTATAATCCATTAATGCTAGATCAAGCCGCATATACTATGTTGCAAATAC3’. 
The third part was PCR amplified by the forward direction primer 

5 ’GCGGCTTGATCTAGCATTAATGGATTATAGGGGGGAAGTCGCTGCTGGTACTCTG3’ 
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Bl-M-Gblock 


B2-M-Gblock 


B3-M-Gblock 



Fig. S8. The generated PCR products of 1000 bps edits from the gBlock method, involving Bl-gBlock, B2-gBlock and 
B3-gBlock. 



Fig. S9. The generated PCR products of 1000bps sequence editing for the OE-PCR based method, and sequences Bl-PCR, 
B2-PCR and B3-PCR. 


and reverse direction primer 


5’AAGAAGAACCAGTAAGCAGC3’. 


All three PCRs used the sequence B3 as the template. All three PCR products are shown in Fig. S8 


OE-PCR was performed in a 50 ul reaction volume containing the above three PCR products without primers for the 
first 5 cycles and with B3 primers listed in Table |S2] for the subsequent 30 cycles. A single bank of correct size 1000 bps 
was obtained (See Fig. S9l. 


Correctness of the synthesized edited regions was confirmed via DNA Sanger sequencing as follows. The PCR products 
of the gBlock method and the OE-PCR method were named Bl-M-gBlock, B2-M-gBlock, B3-M-gBlock and Bl-M-PCR, 
B2-M-PCR, B3-M-PCR, respectively. All final mutations/edits of PCR products were purified using the QiaGe n Gel 
Purification Kit. The purified 1000 bps edited sequences were blunt-ligated to the vector named pCR™-Blunt (Fig. S10) 
using the Zero Blunt PCR Cloning Kit and following the manufacturers’ protocol. Five colonies of each PCR-Blunt- 
mutation were sent to ACTG, Int. Sequencing was performed using two universal primers: M13F_20 (for the reverse 
direction) and M13R (for the forward direction). Bi-directional sequencing was performed in order to ensure that the 
entire 1000 bps block was completely covered. 
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Fig. S10. Map and features of PCR-Blunt vector (Life technologies). 


6 Hybrid DNA-Based and Classical Storage 

In our small-scale experiments, Sanger sequencing produced two erroneous symbols in one strand which we were able 
to correct using prefix matching. One possible problem that may arise in large scale DNA-storage systems involving 
millions of blocks is erroneous sequencing which may not be corrected via prefix matching. In current High Throughput 
Sequencing technologies, such as Illumina HiSeq or MiSeq, the dominant sources of errors are substitutions. Due to 
our word grouping scheme, such substitution errors cannot cause catastrophic error propagation, but may nevertheless 
accumulate as the number of rewrite cycles increases. In this case, prefix matching may not suffice to correct the errors 
and more sophisticated coding schemes need to be used. Unfortunately, adding additional parity-check symbols into the 
prefix-encoded data stream may cause problems as the parities may violate the prefix properties and dis-balance the 
GC content. Furthermore, every time rewriting is performed, the parity-checks will need to be updated, which incurs 
additional cost for maintaining the system. A simple solution to this problem is a hybrid scheme, in which the bulk of 
the information is stored in DNA media, while only parity-checks are stored on a classical device, such as flash memory. 
Given that the current error-rate of short-read sequencing technologies roughly equals 1%, the most suitable codes for 
performing this type of coding are low-density parity-check codes m- These codes offer excellent performance in the 
presence of a large number of errors and are decodable in linear time. 
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